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A novel class of algorithms for restoring a function from a random sample is based on the concept of 
weak convergence, borrows algorithmic solutions from the Optimal Jet Finder (hep-ph/030 1 1 85), of- 
fers a considerable algorithmic flexibility, is applicable to non-positive functions, is insensitive to the 
choice of coordinate axes. A first implementation demonstrates feasibility of the approach. 



Small random raindrops 
can hardly hurt pouzyry. 
Oh! joy of research 

The fundamental problem of modeling a function 
from a random sample has at least two important ap- 
plications: multi-dimensional adaptive MC integration 
(for a review see [1]) and construction of quasi- 
optimal observables for data analysis [2]. 

The conventional view is that a function is a way to 
provide a number / (x) for any number x . However, 

with X measured via a finite precision measurement 
procedure, increasing the number of (unbiased, inde- 
pendent) measurements increases the precision of the 
estimate of x (by taking the standard average) — but 
for / one only obtains the average j / {x)(p{x)dx 

where (p is the probability distribution for individual 
measurements of x. By improving the measurement 
procedure one makes (p more 
narrow and thus can approach 
I / {^)(p{^)dx /(xq) — but only 

if / is continuous at Xq. Otherwise Xq x 

the result depends on the shape of (p. However, in 
practice one rarely if ever cares about how the function 
is defined at the points of discontinuity. 

Therefore, it is logically sufficient to define a func- 
tion by its averages {f,(p) = \ f{x)(p{x)dx with all 

possible test functions (p that possess continuous de- 
rivatives of any order (this is a technical restriction 



imposed for technical convenience without loss of 
meaning) and are equal to zero outside bounded re- 
gions (each (p has its own such region). The averages 
(/, (p) are linear in (p, and one can define a "general- 
ized function" as an arbitrary linear correspondence 
/ \(p ^ l^f,(p)\ a convenient abuse of notation is to 
write {f,(p) = I f(x)(p(x)dx . A familiar example is 
Dirac's J-function. 

To discuss approximations, one must specify the 
meaning of the proposition that a sequence of (gener- 
alized) functions converges to a (generalized) func- 
tion /. The convergence motivated by the conventional 
definition of functions is the pointwise convergence, 
i.e. a convergence (with uncorrelated rates) of all nu- 
merical sequences (x) f (x) for all x . Within the 

framework of the new interpretation, the true argument 
of a function is not x but g), and the notion of conver- 
gence is modified accordingly: {f^,(p) -^{f,(p) for 

all (p, with uncorrelated rates. In practice n stays finite, 
and one ensures a smallness of the differences 
, (p) - (/, (p) for a finite set of (p = (pj^. 



We write f^- 



weak 



-> f and use the qualifier 



"weak" to describe this type of convergence and re- 
lated notions (closeness, etc.). As a first heuristic ap- 
proximation, one may rely on the analogy between the 
weak convergence and metric convergences. 

There are many advantages in replacing the archaic 



"general functions", i.e. mappings f(x), with the 
subtler notion of "generalized functions", i.e. linear 
mappings (p^{f,(p), in our mental arsenal of 

mathematical concepts [3]. It is remarkable that such a 
finesse of interpretation results in truly powerful new 
options for constructive problem solving. In the con- 
text of particle physics, one example is the long-sought 
solution of the problem of asymptotic expansions of 
Feynman diagrams [4], which required an essential use 
of techniques of generalized functions (see a discus- 
sion in [5]). Another example is the discovery of the 
optimal jet definition [6] where the optimal configura- 
tion of jets is regarded as an approximation in the 
sense of weak convergence. It turns out that the latter 
idea has a much wider range of applicability, as is 
shown below. 

Consider a random sample of values {-^^l^j of a 
random variable x distributed according to 7r(x) > . 
The opening idea of the theory of statistics is that for 

^ oo , the sample reproduces the probability distri- 
bution 7r(x) . What would be a precise interpretation 
for that? More generally, given a random sample 

= K ' fn }n=\ ' w^^^^ fn=f(^n)^ ^hat ' s the pre- 
cise meaning of the statement that represent 
/ (x)7r(x) increasingly well for ^ oo ? 

Represent as a sum of J-fiinctions: 

F^ (x) = fn^(^~^n^ ' Then the required pre- 

cise interpretation is as follows: 

W ^-t ' f(^Mx), (1) 

i.e. for any test function (p, the sequence of its inte- 
grals with the l.h.s. converges to its integral with the 

r.h.s., I f{x)7i{x)(p{x)dx , in the usual sense. 

I am not aware of a textbook that would state this in 
an explicit fashion. This may be explained by the fact 
that mathematical statistics had aheady matured [7] by 
the time the ideas of generalized functions only began 
to be publicised [8]. It is important to clearly under- 
stand, however, that the interpretation (1) is an essen- 
tial starting point for all the thinking about how to 
obtain more tractable approximations for the r.h.s. in- 
stead of the l.h.s.; the latter, however, is the only pos- 
sible starting point (perhaps, along with some a priori 
information about /;r that can be used to fine-tune the 
algorithms). 



We will call such constructive approximations 
models, generically denote them as M(x), and require 
that a model M{x) provide constructive algorithms for: 

A) the mapping x M(x) for real x ; 

B) generation of random x distributed according to 
M(x), provided the latter is non-negative. 

So, one has to find a model ^ (x) that would be 

close in the weak sense to the "raw" approximation 
F^ (x) , which fact would guarantee that y (x) 

remains close to / {x)7r(^x) in the weak sense: 



weak 




N 



Moreover, if we impose restrictions on ^ (x) in 

accordance with whatever a priori information we may 
have about y (x) , we may hope that f (x) is 



close to y (x) in a stronger sense (uniform, etc.). 
Mathematical results of this type are well known [9]. 

In practice, the following types of models are used: 

(i) Decompositional models: the function' s domain 
of definition (D is split into non-intersecting subdo- 
mains, ® = Uy^ , n(Dj^^ = , and the model is 
defined to be constant in each subdomain: 
M(x) = const . can either be fixed (as with 

standard histograms) or found adaptively. 

(ii) Galiorkin models: M(x) = ^k^k (^) ' where 
(P^(x) is an orthogonal system of functions, and 
m,^ = j<ixF^(x)(P^(x) . 

(iii) Parametric models: one chooses a function pa- 
rameterized by a number of parameters and adjusts the 
latter to fit the sample. 

(iv) The Vegas model is employed in the Vegas rou- 
tine for multidimensional integration [10]. It is a direct 
product of one-dimensional adaptive decompositional 
models, M(x) = Ylf ^ji^j) • The popularity of Vegas 

shows that even such very crude approximation can be 
a valuable model in many dimensions. 

(v) Kernel models. Let K(x) be any convenient 
(usually hat-like) function such that 



KJ,(X): 



^.R-^'^K(R-^x)- 



weak 



^S(x). 



(2) 



Then it is sufficient to replace the individual S- 
functions 5{x-x^) in F^{x) with Kj^(x-x^) . In 
general, R should be smaller for larger N. 

(vi) NN models. The most popular simplest neural 
networks [11] are described by the analytical expres- 
sion M(x) = g{llk ^kS{l^j Aj^j wheregisa 

smooth step-like function. If one drops the outermost 
g (which is irrelevant in the present context), there 
remains a linear combination of rotated and shifted 
step-like functions. This is to be compared with the 
kernel models: rotated and shifted step-like functions 
roughly correspond to infmite-7? kernels positioned at 
infinite points in various directions. 

The kernel models (v) remain, perhaps, least stud- 
ied. The approach seems to become impractical for 
large N — the case which is often the most interesting. 
It could be advantageous to "condense" the sum of 8- 
functions to a (i) smaller number of (ii) more regularly 
distrubuted J's. This must be done so as to ensure a 
weak closeness of the condensed sum to the original 
one: 

weak 1 

W - -lLfp5{x-%) + fo^Fp{x)^ (3) 
p 

i.e. so as to minimize the differences 

\{Fj,f)-{Fpf)\ (4) 
for test functions (p. Then the scheme becomes 

(5) 

The replacement (3) based on minimization of (4) is 
exactly what is effected in the optimal jet definition 
[6]. Repeating the reasoning of [6] with appropriate 
simple modifications, one arrives at the following cri- 
teria for finding x^ and : 

One introduces an x matrix, < ^ < 1 , 

Z^^\-Y.pZ^^^>^, and sets fp=l.n^n,pfn^ 

fpXp ^^n^n^pfn^n^ SO that z^^ bccomcs the un- 
known. The matrix z is found from the requirement of 
minimization of the following expression: 



1 2 

^ p^n n 

This controls the remainder of a Taylor expansion of 
(4) in x^-Xp, with the leading terms nullified by the 

above restrictions on and . The parameter R 

remains fi'ee. 

The minima of Q.r correspond to configurations 
with z p equal to or 1. The set of sample points 

with z^^ =1 constitutes the p-thpoi/zyr \ x^ and 

are the pouzyr's location and weight (or charge). The 
domain spanned by a pouzyr is always convex, cen- 
tered at Xp , and its radius does not exceed 7?. 

It should be empasized that the criterion of mini- 
mizing ^li? is a constructive expression of the require- 
ment to make the original configuration of sample 
points and the resulting configuration of pouzyry as 
close as possible in the weak sense. 

It is remarkable that the first step of the model con- 
struction within the pouzyry approach (the left down- 
ward arrow in (5)) is performed within the realm of 
singular generalized functions. In this respect the 
pouzyry scheme is entirely novel and unusual. 

An exploratory algorithm to minimize Q.r was ob- 
tained by modifying the Optimal Jet Finder [12]. To 
ensure mathematical correctness of the resulting modi- 
fied algorithm and that subtle bugs are not introduced 
during the modification, I chose to work with the stati- 
cally safe, modular, object-oriented programming lan- 
guage Component Pascal [13], and chose the verifica- 
tion version of the optimal jet finder [14], designed for 
robustness, as a starting point for modifications. 

I report the following initial findings: 

1) One has to choose P and R prior to running the 
minimization. From general considerations, P should 

not be chosen larger than 0{^Jn^ (which is large 

enough for practical purposes). An optimal choice of 
R depends on the function; it may e.g. correspond to a 
typical distance over which first order derivatives of 
the function vary appreciably. 

2) The key element of the minimum search algo- 
rithm here is an iteration step which scans all sample 
points once and modifies the configuration of pouzyry. 
This iteration step is the only highly technical element 
of the algorithm not really subject to variations (apart 



^ Pouzyr means "bubble" in Russian, the plural being pouzyry, to be 
pronounced according to the rules of French. 



from possible optimizations). All other elements allow 
variations. For instance, the shapes of kernels are arbi- 
trary. Similarly, each kernel in the resulting sum can 
be given its own R depending on the effective radius 
of the corresponding pouzyr. Such variations should 
be employed to incorporate the properties of the solu- 
tion to maximal degree. 

3) The time required to execute a single iteration 
step is the same as in the case of the Optimal Jet 
Finder [12], O(PxN) , i.e. linear in both the number 

of sample points and the number of pouzyry. The CPU 
time per iteration is a fraction of a second on a 866 
MHz computer for dim = 7, TV = 200, P=10. 

4) The resulting configuration of pouzyry depends 
on the initial one (the starting point for minimum 
search). With a purely random choice (all random) 
all the pouzyry are initially located near the middle of 
the integration domain (the effect of averaging). This 
may not be optimal. It may help to devise smarter 
ways to choose the initial configuration. 

5) In some cases (e.g. for large R) one may observe 
the following behavior: at first the configuration of 
pouzyry converges pretty fast, but then the conver- 
gence slows down greatly; the configuration may 
change significantly over O(IOO) iterations. This 
means that a straightforward minimization may not be 
an optimal strategy in the more complex cases. 

6) Several ideas to improve upon the simplest 
pouzyry scheme suggest themselves, namely: (a) to 
"breed" better configurations of pouzyry using e.g. the 
ideas of genetic algorithms; (b) to make R depend on p 
in the formula for Q,r ; (c) to seek a model which is a 
sum of pouzyry configurations with various R : contri- 
butions with larger R would describe larger-scale be- 
havior of the function, whereas contributions with 
smaller R would describe narrow structures. 

7) Since O JF reliably finds narrow clusters [12], the 
pouzyry scheme is guaranteed to reliably find narrow 
spikes in the initial sample. More generally, the better 
a narrow structure can be approximated by a sum of 
spikes, the better pouzyry would work. 

8) As an example, consider the function 
/ (x) = MAX(1 - r/ p, 0) , where r is the euclidean dis- 
tance from X to the diagonal of the hypercube, and p 
describes the width of the diagonal strip within which 
/ is non-zero. For dim = 1, p = 0.l,N = 200 (only sam- 
ple points with non-zero values of f were retained and 
counted), P = lO,R = 0.l,it takes about 1 5 iterations to 
reach a minimum [15]. The quality of the resulting 
model can be judged from how much better is the MC 
integration with the probability distribution given by 



the model compared with the simplest MC with uni- 
formly distributed sample points. In the present case, 
despite the obviously too low P (not enough to cover 
the entire diagonal at the chosen R), the statistical in- 
tegration error is improved by a factor of 7 which 
translates into a factor of 49 reduction of CPU time. 
Note that the Vegas algorithm is defeated by such di- 
agonal structures, whereas the pouzyry scheme is in- 
sensitive to the choice of coordinates. On the other 
hand, with an unfortunate choice of P or R, the 
pouzyry scheme may not yield any significant im- 
provement over the simplest Monte Carlo integration. 
9) Pouzyry, which deals well with narrow struc- 
tures, could be combined with other methods well 
suited for description of the smooth global structure of 
the function, e.g. the Galiorkin methods. Note that the 
pouzyry scheme does contain a constant background 
already (^ in (3)) 

The described pouzyry scheme is novel and rather 
unusual, and given all the algorithmic options it opens, 
it is hard to assess its potential at present. I should be 
content to have demonstrated its feasibility. 
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